Skip to content

Conversation

@kcons
Copy link
Member

@kcons kcons commented Oct 29, 2025

This approach allows the backfill to be run to completion and tracked without creating significant task backlog.
The first phase is a one-time status row creation, which is run in a task loop.
From there, we can start triggering the coordinator periodically and having it manage scheduling tasks for work items, first very slowly to verify, then in increasing volume.
We can ensure that the capacity cost of this backfill is relatively fixed regardless of processing rate, and failed tasks can naturally be rescheduled without them starving out possibly successful tasks.

This PR is structured as a framework and a job using it even though we may never actually need to reuse the framework because this model makes it easier to review the job management and the error backfill elements separately; not having the separation is less code, but ultimately a larger conceptual chunk. By separating it, we also have the option of doing more bulk processing of this sort.

Process-wise, we'd:

  1. add a job to create the bulk job status chunks, run it, validate.
  2. Add a cron trigger to the coordinator task, targeting 1 run at a time at most.
  3. Verify that the low-rate processing is working as intended, then increase concurrent job count gradually.
  4. Depending on burn down rate and timeline, tweak job count to something sustainable that wont backlog our cluster, and set up a dashboard to show how soon we'll be done.

Once done, we can delete and drop the table, or we can leave it for reuse.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 29, 2025
@github-actions
Copy link
Contributor

This PR has a migration; here is the generated SQL for src/sentry/workflow_engine/migrations/0094_add_error_backfill_status.py

for 0094_add_error_backfill_status in workflow_engine

--
-- Create model ErrorBackfillStatus
--
CREATE TABLE "workflow_engine_error_backfill_status" ("id" bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, "date_updated" timestamp with time zone NOT NULL, "date_added" timestamp with time zone NOT NULL, "status" varchar(20) NOT NULL, "detector_id" bigint NOT NULL UNIQUE);
ALTER TABLE "workflow_engine_error_backfill_status" ADD CONSTRAINT "workflow_engine_erro_detector_id_6e5eb8d9_fk_workflow_" FOREIGN KEY ("detector_id") REFERENCES "workflow_engine_detector" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "workflow_engine_error_backfill_status" VALIDATE CONSTRAINT "workflow_engine_erro_detector_id_6e5eb8d9_fk_workflow_";
CREATE INDEX CONCURRENTLY "workflow_engine_error_backfill_status_status_3d9773bb" ON "workflow_engine_error_backfill_status" ("status");
CREATE INDEX CONCURRENTLY "workflow_engine_error_backfill_status_status_3d9773bb_like" ON "workflow_engine_error_backfill_status" ("status" varchar_pattern_ops);
CREATE INDEX CONCURRENTLY "errbkfl_stat_upd_idx" ON "workflow_engine_error_backfill_status" ("status", "date_updated");

@getsantry
Copy link
Contributor

getsantry bot commented Nov 20, 2025

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants